Skip to content

feat: PDF ingestion via unpdf with page chunking#33

Open
RaviTharuma wants to merge 1 commit intoAVIDS2:mainfrom
RaviTharuma:feature/memorix-nmi-pdf-ingestion
Open

feat: PDF ingestion via unpdf with page chunking#33
RaviTharuma wants to merge 1 commit intoAVIDS2:mainfrom
RaviTharuma:feature/memorix-nmi-pdf-ingestion

Conversation

@RaviTharuma
Copy link
Copy Markdown
Contributor

Summary

Adds an memorix_ingest_pdf MCP tool that extracts text from PDF files using unpdf and stores each page (or chunk) as a memorix observation.

What's included

  • src/multimodal/pdf-loader.ts — PDF text extractor with:
    • Page-level chunking (one observation per page by default)
    • Configurable max pages
    • Graceful handling of empty/corrupted PDFs
    • Optional unpdf dependency (clear error if not installed)
  • src/server.ts — MCP tool registration (memorix_ingest_pdf)
  • tests/multimodal/pdf-loader.test.ts — 4 tests covering:
    • Multi-page extraction and observation storage
    • Empty PDF handling
    • Page limit configuration
    • Missing file error handling

Design decisions

  • unpdf is an optional peer dependency — tool returns clear error if not installed (npm install unpdf).
  • Each page becomes a separate observation for granular search/retrieval.
  • Stores as how-it-works observation type with PDF metadata (page number, total pages, filename) in facts.
  • No heavy native dependencies — unpdf is pure JS/WASM.

Tests

4 pass, 0 fail

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0e39c413c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

*/
export async function extractPdfText(input: PdfInput): Promise<PdfExtractionResult> {
// Dynamic import — unpdf is optional
const unpdf = await import('unpdf').catch(() => null) as {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Declare unpdf as an optional dependency

This new loader hard-depends on resolving import('unpdf'), but the package is not declared in Memorix’s dependency metadata, so default installs won’t provide it. In the documented global-install flow (npm install -g memorix), telling users to run local npm install unpdf still won’t make this import resolvable from the global package tree, so memorix_ingest_pdf will consistently fail for those users.

Useful? React with 👍 / 👎.

Comment on lines +58 to +60
const limit = Math.min(rawPages.length, maxPages);

for (let i = 0; i < limit; i++) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce integer maxPages before page loop

maxPages is treated as an arbitrary number and used directly in i < limit, which causes fractional values to round up in practice (e.g., maxPages=1.1 processes pages 1 and 2). Negative values also silently result in zero ingestion. Because this tool parameter represents a page count, it should be validated/coerced to a positive integer before computing the loop bound.

Useful? React with 👍 / 👎.

@@ -0,0 +1,32 @@
import { describe, it, expect } from 'bun:test';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Switch new test file to Vitest imports

The repository’s test runner is Vitest (npm test runs vitest run), but this file imports from bun:test. In the Node/Vitest environment this module is not available, so the new test suite cannot execute as part of normal CI/test runs.

Useful? React with 👍 / 👎.

@AVIDS2
Copy link
Copy Markdown
Owner

AVIDS2 commented Mar 30, 2026

Thanks for working on PDF ingestion — page-level storage is a good direction, but this branch still has a few hard blockers.

  1. The test file is using bun:test, while the repo runs vitest in CI.
  2. src/multimodal/pdf-loader.ts currently fails typecheck because unpdf is dynamically imported but not represented in a way TypeScript can resolve in this repo's current dependency/type setup.
  3. In src/server.ts, the object flowing into storeObservation is inferred as type: string rather than a proper ObservationType.

So at the moment this is not just a dependency-install issue — it needs one more pass to make the optional dependency story and the repo's type/test expectations line up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants